{
    "cells": [
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "<i>Copyright (c) Recommenders contributors.</i>\n",
                "\n",
                "<i>Licensed under the MIT License.</i>"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "# TF-IDF Content-Based Recommendation on the COVID-19 Open Research Dataset\n",
                "This demonstrates a simple implementation of Term Frequency Inverse Document Frequency (TF-IDF) content-based recommendation on the [COVID-19 Open Research Dataset](https://azure.microsoft.com/en-us/services/open-datasets/catalog/covid-19-open-research/), hosted through Azure Open Datasets.\n",
                "\n",
                "In this notebook, we will create a recommender which will return the top k recommended articles similar to any article of interest (query item) in the COVID-19 Open Research Dataset."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 1,
            "metadata": {
                "scrolled": true
            },
            "outputs": [
                {
                    "name": "stdout",
                    "output_type": "stream",
                    "text": [
                        "System version: 3.6.11 | packaged by conda-forge | (default, Nov 27 2020, 18:57:37) \n",
                        "[GCC 9.3.0]\n"
                    ]
                }
            ],
            "source": [
                "import sys\n",
                "\n",
                "from recommenders.datasets import covid_utils\n",
                "from recommenders.models.tfidf.tfidf_utils import TfidfRecommender\n",
                "\n",
                "# Print version\n",
                "print(f\"System version: {sys.version}\")"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### 1. Load the dataset into a dataframe\n",
                "Let's begin by loading the metadata file for the dataset into a Pandas dataframe. This file contains metadata about each of the scientific articles included in the full dataset."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 2,
            "metadata": {},
            "outputs": [
                {
                    "name": "stderr",
                    "output_type": "stream",
                    "text": [
                        "/home/scgraham/miniconda3/envs/reco_base/lib/python3.6/site-packages/IPython/core/interactiveshell.py:3263: DtypeWarning: Columns (13,14) have mixed types.Specify dtype option on import or set low_memory=False.\n",
                        "  if (await self.run_code(code, result,  async_=asy)):\n"
                    ]
                }
            ],
            "source": [
                "# Specify container and metadata filename\n",
                "container_name = 'covid19temp'\n",
                "metadata_filename = 'metadata.csv'\n",
                "sas_token = ''  # please see Azure Open Datasets notebook for SAS token\n",
                "\n",
                "# Get metadata (may take around 1-2 min)\n",
                "metadata = covid_utils.load_pandas_df(container_name=container_name, metadata_filename=metadata_filename, azure_storage_sas_token=sas_token)"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### 2. Extract articles in the public domain\n",
                "The dataset contains articles using a variety of licenses. We will only be using articles that fall under the public domain ([cc0](https://creativecommons.org/publicdomain/zero/1.0/))."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 3,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/plain": [
                            "<AxesSubplot:title={'center':'License'}>"
                        ]
                    },
                    "execution_count": 3,
                    "metadata": {},
                    "output_type": "execute_result"
                },
                {
                    "data": {
                        "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYQAAAE5CAYAAACQ6Vd4AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuNCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8QVMy6AAAACXBIWXMAAAsTAAALEwEAmpwYAAAqaUlEQVR4nO3de7hcVX3/8feHBJC7XBKEBEmUFAooIAGiUkRSJVYRRMBQ0FTSRhAVflW52PbxgtGgRSpUsCm3gEKIoCWoqGm4WAoNHAQJ4VJSiJCCJMhFtIIEvr8/1hoyZzI5Z2bPOucMJ5/X88wzM2tmf8+amX32d++11l5bEYGZmdl6Q10BMzPrDk4IZmYGOCGYmVnmhGBmZoATgpmZZU4IZmYGOCGYrUHSn0l6YKjrYTbY5PMQbF0maRnw1xHx70NdF7Oh5iMEMzMDnBDM1iDpQEnL657vIOn7klZK+o2kf6577ThJ90l6WtJPJe1Y91pIOl7Sg/n1b0lSfm0nSTdJelbSk5KurFtuF0kLJD0l6QFJRw3WZ7d1mxOCWR8kjQB+CPwKGAeMAebm1w4DPgccDowC/gO4oiHE+4B9gD2Ao4CDc/kZwM+ALYGxwLk55ibAAuByYDRwNHCepN0G4OOZ9eKEYNa3fYHtgc9GxO8j4vmIuDm/9jHgqxFxX0SsAr4C7Fl/lADMiohnIuIR4AZgz1z+IrAjsH1DzPcByyLi4ohYFRG/AK4GjhjQT2mGE4JZf3YAfpU3+I12BL4p6RlJzwBPASIdRdT8uu7x/wGb5sen5PfeJmmJpOPqYu5Xi5njHgO8rtQHMlubkUNdAbMu9yjwekkjmySFR4GZEfHddoNGxK+BvwGQtD/w75J+nmPeFBHv6rDeZm3zEYIZrC/pNbUbvXeUbgMeB2ZJ2iS/5+35tW8Dp9fa9yVtIenIVv6gpCMljc1PnwYCeInUX/Enkj4saf1820fSnxb4nGZ9ckIwgx8Df6i7faH2QkS8BBwC7AQ8AiwHPpRf+wFwJjBX0m+Be4D3tPg39wEWSfodMB84KSIejojngHcDU4HHSE1OZwIbdvYRzfrnE9PMzAzwEYKZmWVOCGZmBjghmJlZ5oRgZmaAE4KZmWWv2hPTttlmmxg3btxQV8PM7FXljjvueDIiRjV77VWbEMaNG0dPT89QV8PM7FVF0q/W9pqbjMzMDHBCMDOzzAnBzMwAJwQzM8ucEMzMDHBCMDOzzAnBzMwAJwQzM8tetSemNTPutB+19L5ls947wDUxM3v18RGCmZkBTghmZpY5IZiZGeCEYGZmmROCmZkBTghmZpY5IZiZGeCEYGZmmROCmZkBTghmZpY5IZiZGeCEYGZmmROCmZkBTghmZpa1lBAkvVbSVZLul3SfpLdK2krSAkkP5vst695/uqSlkh6QdHBd+d6SFufXzpGkXL6hpCtz+SJJ44p/UjMz61OrRwjfBH4SEbsAewD3AacBCyNiArAwP0fSrsBUYDdgCnCepBE5zvnADGBCvk3J5dOBpyNiJ+Bs4MwOP5eZmbWp34QgaXPgAOBCgIj4Y0Q8AxwKzMlvmwMclh8fCsyNiBci4mFgKbCvpO2AzSPi1ogI4NKGZWqxrgIm144ezMxscLRyhPAGYCVwsaQ7JV0gaRNg24h4HCDfj87vHwM8Wrf88lw2Jj9uLO+1TESsAp4Ftq70iczMrJJWEsJI4C3A+RGxF/B7cvPQWjTbs48+yvtapndgaYakHkk9K1eu7LvWZmbWllYSwnJgeUQsys+vIiWIJ3IzEPl+Rd37d6hbfizwWC4f26S81zKSRgJbAE81ViQiZkfExIiYOGrUqBaqbmZmreo3IUTEr4FHJe2ciyYD9wLzgWm5bBpwTX48H5iaRw6NJ3Ue35ablZ6TNCn3D3ykYZlarCOA63M/g5mZDZKRLb7vk8B3JW0APAR8lJRM5kmaDjwCHAkQEUskzSMljVXAiRHxUo5zAnAJsBFwXb5B6rC+TNJS0pHB1A4/l5mZtamlhBARdwETm7w0eS3vnwnMbFLeA+zepPx5ckIxM7Oh4TOVzcwMcEIwM7PMCcHMzAAnBDMzy5wQzMwMcEIwM7PMCcHMzAAnBDMzy5wQzMwMcEIwM7PMCcHMzAAnBDMzy5wQzMwMcEIwM7PMCcHMzAAnBDMzy5wQzMwMcEIwM7PMCcHMzAAnBDMzy5wQzMwMcEIwM7PMCcHMzIAWE4KkZZIWS7pLUk8u20rSAkkP5vst695/uqSlkh6QdHBd+d45zlJJ50hSLt9Q0pW5fJGkcYU/p5mZ9aOdI4R3RsSeETExPz8NWBgRE4CF+TmSdgWmArsBU4DzJI3Iy5wPzAAm5NuUXD4deDoidgLOBs6s/pHMzKyKTpqMDgXm5MdzgMPqyudGxAsR8TCwFNhX0nbA5hFxa0QEcGnDMrVYVwGTa0cPZmY2OFpNCAH8TNIdkmbksm0j4nGAfD86l48BHq1bdnkuG5MfN5b3WiYiVgHPAls3VkLSDEk9knpWrlzZYtXNzKwVI1t839sj4jFJo4EFku7v473N9uyjj/K+luldEDEbmA0wceLENV43M7PqWjpCiIjH8v0K4AfAvsATuRmIfL8iv305sEPd4mOBx3L52CblvZaRNBLYAniq/Y9jZmZV9ZsQJG0iabPaY+DdwD3AfGBafts04Jr8eD4wNY8cGk/qPL4tNys9J2lS7h/4SMMytVhHANfnfgYzMxskrTQZbQv8IPfxjgQuj4ifSLodmCdpOvAIcCRARCyRNA+4F1gFnBgRL+VYJwCXABsB1+UbwIXAZZKWko4Mphb4bGZm1oZ+E0JEPATs0aT8N8DktSwzE5jZpLwH2L1J+fPkhGJmZkPDZyqbmRnghGBmZpkTgpmZAU4IZmaWOSGYmRnghGBmZpkTgpmZAU4IZmaWOSGYmRnghGBmZpkTgpmZAU4IZmaWOSGYmRnghGBmZpkTgpmZAU4IZmaWOSGYmRnghGBmZpkTgpmZAU4IZmaWOSGYmRnghGBmZlnLCUHSCEl3Svphfr6VpAWSHsz3W9a993RJSyU9IOnguvK9JS3Or50jSbl8Q0lX5vJFksYV/IxmZtaCdo4QTgLuq3t+GrAwIiYAC/NzJO0KTAV2A6YA50kakZc5H5gBTMi3Kbl8OvB0ROwEnA2cWenTmJlZZS0lBEljgfcCF9QVHwrMyY/nAIfVlc+NiBci4mFgKbCvpO2AzSPi1ogI4NKGZWqxrgIm144ezMxscLR6hPBPwCnAy3Vl20bE4wD5fnQuHwM8Wve+5blsTH7cWN5rmYhYBTwLbN3qhzAzs871mxAkvQ9YERF3tBiz2Z599FHe1zKNdZkhqUdSz8qVK1usjpmZtaKVI4S3A++XtAyYCxwk6TvAE7kZiHy/Ir9/ObBD3fJjgcdy+dgm5b2WkTQS2AJ4qrEiETE7IiZGxMRRo0a19AHNzKw1/SaEiDg9IsZGxDhSZ/H1EXEsMB+Ylt82DbgmP54PTM0jh8aTOo9vy81Kz0malPsHPtKwTC3WEflvrHGEYGZmA2dkB8vOAuZJmg48AhwJEBFLJM0D7gVWASdGxEt5mROAS4CNgOvyDeBC4DJJS0lHBlM7qJeZmVXQVkKIiBuBG/Pj3wCT1/K+mcDMJuU9wO5Nyp8nJxQzMxsaPlPZzMwAJwQzM8ucEMzMDHBCMDOzzAnBzMwAJwQzM8ucEMzMDHBCMDOzzAnBzMwAJwQzM8ucEMzMDHBCMDOzzAnBzMwAJwQzM8ucEMzMDHBCMDOzzAnBzMwAJwQzM8ucEMzMDHBCMDOzzAnBzMwAJwQzM8v6TQiSXiPpNkm/lLRE0hdz+VaSFkh6MN9vWbfM6ZKWSnpA0sF15XtLWpxfO0eScvmGkq7M5YskjRuAz2pmZn1o5QjhBeCgiNgD2BOYImkScBqwMCImAAvzcyTtCkwFdgOmAOdJGpFjnQ/MACbk25RcPh14OiJ2As4Gzuz8o5mZWTtG9veGiAjgd/np+vkWwKHAgbl8DnAjcGounxsRLwAPS1oK7CtpGbB5RNwKIOlS4DDgurzMF3Ksq4B/lqT8t4fMuNN+1NL7ls167wDXxMxs4LXUhyBphKS7gBXAgohYBGwbEY8D5PvR+e1jgEfrFl+ey8bkx43lvZaJiFXAs8DWFT6PmZlV1FJCiIiXImJPYCxpb3/3Pt6uZiH6KO9rmd6BpRmSeiT1rFy5sp9am5lZO9oaZRQRz5CahqYAT0jaDiDfr8hvWw7sULfYWOCxXD62SXmvZSSNBLYAnmry92dHxMSImDhq1Kh2qm5mZv1oZZTRKEmvzY83Av4cuB+YD0zLb5sGXJMfzwem5pFD40mdx7flZqXnJE3Ko4s+0rBMLdYRwPVD3X9gZrau6bdTGdgOmJNHCq0HzIuIH0q6FZgnaTrwCHAkQEQskTQPuBdYBZwYES/lWCcAlwAbkTqTr8vlFwKX5Q7op0ijlMzMbBC1MsrobmCvJuW/ASavZZmZwMwm5T3AGv0PEfE8OaGYmdnQ8JnKZmYGOCGYmVnmhGBmZoATgpmZZU4IZmYGOCGYmVnmhGBmZoATgpmZZU4IZmYGOCGYmVnmhGBmZoATgpmZZU4IZmYGOCGYmVnmhGBmZoATgpmZZU4IZmYGOCGYmVnmhGBmZoATgpmZZU4IZmYGOCGYmVnmhGBmZkALCUHSDpJukHSfpCWSTsrlW0laIOnBfL9l3TKnS1oq6QFJB9eV7y1pcX7tHEnK5RtKujKXL5I0bgA+q5mZ9aGVI4RVwKcj4k+BScCJknYFTgMWRsQEYGF+Tn5tKrAbMAU4T9KIHOt8YAYwId+m5PLpwNMRsRNwNnBmgc9mZmZt6DchRMTjEfGL/Pg54D5gDHAoMCe/bQ5wWH58KDA3Il6IiIeBpcC+krYDNo+IWyMigEsblqnFugqYXDt6MDOzwdFWH0JuytkLWARsGxGPQ0oawOj8tjHAo3WLLc9lY/LjxvJey0TEKuBZYOsmf3+GpB5JPStXrmyn6mZm1o+WE4KkTYGrgZMj4rd9vbVJWfRR3tcyvQsiZkfExIiYOGrUqP6qbGZmbWgpIUhan5QMvhsR38/FT+RmIPL9ily+HNihbvGxwGO5fGyT8l7LSBoJbAE81e6HMTOz6loZZSTgQuC+iPhG3UvzgWn58TTgmrryqXnk0HhS5/FtuVnpOUmTcsyPNCxTi3UEcH3uZzAzs0EysoX3vB34MLBY0l257HPALGCepOnAI8CRABGxRNI84F7SCKUTI+KlvNwJwCXARsB1+QYp4VwmaSnpyGBqZx/LzMza1W9CiIibad7GDzB5LcvMBGY2Ke8Bdm9S/jw5oZiZ2dDwmcpmZgY4IZiZWeaEYGZmgBOCmZllTghmZgY4IZiZWeaEYGZmgBOCmZllTghmZgY4IZiZWeaEYGZmQGuT21kB4077UUvvWzbrvQNcEzOz5nyEYGZmgBOCmZllTghmZgY4IZiZWeaEYGZmgBOCmZllTghmZgY4IZiZWeYT016lfKKbmZXmIwQzMwNaSAiSLpK0QtI9dWVbSVog6cF8v2Xda6dLWirpAUkH15XvLWlxfu0cScrlG0q6MpcvkjSu8Gc0M7MWtHKEcAkwpaHsNGBhREwAFubnSNoVmArslpc5T9KIvMz5wAxgQr7VYk4Hno6InYCzgTOrfhgzM6uu34QQET8HnmooPhSYkx/PAQ6rK58bES9ExMPAUmBfSdsBm0fErRERwKUNy9RiXQVMrh09mJnZ4Knah7BtRDwOkO9H5/IxwKN171uey8bkx43lvZaJiFXAs8DWFetlZmYVle5UbrZnH32U97XMmsGlGZJ6JPWsXLmyYhXNzKyZqgnhidwMRL5fkcuXAzvUvW8s8FguH9ukvNcykkYCW7BmExUAETE7IiZGxMRRo0ZVrLqZmTVTNSHMB6blx9OAa+rKp+aRQ+NJnce35Wal5yRNyv0DH2lYphbrCOD63M9gZmaDqN8T0yRdARwIbCNpOfB5YBYwT9J04BHgSICIWCJpHnAvsAo4MSJeyqFOII1Y2gi4Lt8ALgQuk7SUdGQwtcgnMzOztvSbECLi6LW8NHkt758JzGxS3gPs3qT8eXJCMTOzoeMzlc3MDHBCMDOzzAnBzMwAJwQzM8ucEMzMDHBCMDOzzAnBzMwAJwQzM8ucEMzMDHBCMDOzzAnBzMwAJwQzM8ucEMzMDHBCMDOzzAnBzMyAFq6HYOuGcaf9qKX3LZv13gGuiZkNFR8hmJkZ4IRgZmaZE4KZmQFOCGZmljkhmJkZ4FFGNgA8Ysns1ckJwbqeE4zZ4OiahCBpCvBNYARwQUTMGuIq2TBVOsG0Es/Jyl4NuiIhSBoBfAt4F7AcuF3S/Ii4d2hrZja4fDRkQ6krEgKwL7A0Ih4CkDQXOBRwQjDrQDcfDTn5dR9FxFDXAUlHAFMi4q/z8w8D+0XEJxreNwOYkZ/uDDzQQvhtgCcLVrdkvG6uW+l43Vy30vG6uW7dHq+b61Y63lDVbceIGNXshW45QlCTsjUyVUTMBma3FVjqiYiJVSs2kPG6uW6l43Vz3UrH6+a6dXu8bq5b6XjdWLduOQ9hObBD3fOxwGNDVBczs3VStySE24EJksZL2gCYCswf4jqZma1TuqLJKCJWSfoE8FPSsNOLImJJofBtNTENcrxurlvpeN1ct9Lxurlu3R6vm+tWOl7X1a0rOpXNzGzodUuTkZmZDTEnBDMzA5wQzMwsG5YJQdJWTcrGD0VdBpqk90kq9jtK2qtUrNJK162bP6vZUBiWncqS/hN4T0T8Nj/fFZgXEbtXiPUB4PqIeDY/fy1wYET8W5txzqXJyXY1EfGpduuW434HeCtwNXBxRNxXJU5dvBuA7YDvAXM7Ge1V6rsbiLqVijeAv+sE4KvArsBr6uK9oWK8UcCpTeIdVDHelsCEhlg/rxBnF9I0NWNI3+NjwPxO1uOS352k1wDTgd0aYh3XZpw1dlLrRcRT7dYtxy36uw7LIwTgK8C1kjaVtDfpH/7YirE+X9ugAUTEM8DnK8TpAe4g/WhvAR7Mtz2BlyrWjYg4FtgL+B/gYkm3SpohabOK8d4JHAisBGZLWizp7ytWr9R3NxB1KxVvQH5X4GLgfGAV8E7gUuCyDuJ9F7gPGA98EVhGOv+nbZL+Gvg5aZj4F/P9FyrEORWYS5qp4LZcHwFXSDqtSt2ykt/dZcDrgIOBm0gnzT5XIc4drF5XVgL/TVpPVuayqor9rgBExLC8AYcBtwCLgQkdxLm7SdniDuLdAKxf93x94IYCn3cb4OS8QlxHWtk+2WHMN5H+If7YDd9dyboNwGct+rsCdzR+X8B/FIh3d13ZTRVjLSYlwLvy812AKyvE+e/676yufAPgwW747oA767+3/Lte30Hdvg38Rd3z9wBndcPvGhHdcWJaKU0O3zcHHgI+KYmodvjeI+kbpOm5A/gknWX07YHNgNoh4qa5rBJJhwDHAW8kbdD2jYgVkjYm7Tmc22a8PwU+BBwB/Ia0B/fpitUr+t0VrlvpeEV/V+D53Df0YD5p83+B0R3EezHfPy7pvaSmmbFV6xYRz0tC0oYRcb+knSvEeZn0Hf2qoXy7/FpVJb+72vf2jKTdgV8D4zqo2z4RcXztSURcJ+mMDuKV/F2HV0IgHZLV62TDXfNJ4B+AK/PznwGVmymAWcCduf0a4B1UONyucyRwdjS030bE/0lqq50zuxi4Anh3RHQ6n1T9dyfSd3diB/FK1q10vNK/68nAxsCngDOAg4BpHcT7sqQtSAnvXNLO0v+rGGt57g/6N2CBpKepNvfYycBCSQ8Cj+ay1wM7AZ9Y20Itxi313c3O/SX/QJpOZ9P8uKonc7Pkd0g7SceSdkaqKvm7Ds9O5W4n6XXAfvnpooj4dYF4+5JWsNs7jWfVlP5dc8zNgIiI33UaayBIegewBfCTiPhjheXXI627Y0g7DctJ63An/S+12F333eXO5c8DB+SinwNfjIqdyqUNy4Qg6e2kvbMdSUdBIq0YVUYZLACOjNQhWhtdMTciDm4zzi750PotzV6PiF+0W7ccdzppBbue9DnfAXwpIi5qM868iDhK0mJ6N7vVvrs3V6jbnwCfIR1iv3I0Gm2OgChdt4H4rDnuGFavc0C1kTc51ptInaG10SlPAtMi4p6K8caS9iD3JzXH3AycFBHLK8R6I7A8Il6QdCDwZuDS2v9IhXiTgCUR8Vx+vhmwa0Qsqhiv2HeX976/APxZLroROCPqBksMJUlfA74M/AH4CbAHcHJEfKdSvGGaEO4nHTbdQd1Ij4ho+9BM0p0RsVd/ZS3EmR0RM+qaFOpFuxvJurgPAG+rfTZJWwO3RERbbbqStouIxyXt2Oz1iGhs520l5i9JnWiNv0NbTXml6zZAn/VMUn/EEla3f0dEvL/dWDneLcDfRcQN+fmBwFci4m0V4y0ALmf1aJtjgWMi4l0VYt0FTCQl+p+SmlJ2joi/qFi3O4G3RN4Y5aOGnohouvPUQrxi352kq4F7gDm56MPAHhFxeMW6FdlJqot3V0TsmYd4H0ba7t0QEXtUiTfc+hBqno2I6wrFelnS6yPiEYC8EWk7i0bEjHz/zkL1qllO72Fwz7G6PbZlEfF4frhJNFzLOv9Dtb2RBFZFxPkVluuldN0G6LMeRtoovlBh2WY2qW3QACLiRkmbdBBvVERcXPf8EkknV4z1cqQZij8A/FNEnJs36lWplgwAIuJlSZ1sm0p+d2+MiA/WPf9iTohVfY+0k3QBnQ1Lrlk/3/8FcEVEPCU1u95Ya4ZrQrhB0teB7wOv/INWbJb5O+BmSTfl5wew+jKebct7zXNJJ8r9Twdx/jY//F9gkaRrSInqUNKY7qrmSboM+BppaOHXSHuDb60Q61pJHwd+QO/foWp7acm6lY73EOmfs1RCeEjSP9B7j/7hDuI9KelYUic6wNFU78x8UdLRpI7aQ3LZ+n28vz8PSfoU6dwBgI+Tvs9O4pX67v4gaf+IuBleaY7+Qwd1K7KTVOfa3CLyB+DjSieqPV812HBtMqrtHdQ+XK1tuOph2TbApBzn1oiofB3UfITxoXx7mTQCZ17tCKSNOH2e4BURX6xYv02AM4G9ScMovwucGRFtDwOU1OyfsFJfTum6lY6Xmxb2ABbSO/lVPVN5S9KJRvuT1rubSJ2PT1eM93rgn0nJLkjn6JxUsXlsV+B40v/CFUrTwnwoImZVrNto4BzSaKAgfYcnRcTKivGKfXeS9iD1R2yRYz1F6o+4u2LdvgCsoNxOUu3z/jYiXsrr9GZVBzQM14TQbGMZEfGlDuN+ISK+0EmMhngTSEPYjomIEQXiva7qilAXYwNgJvAu0hC7v4+IuZ3WrYTSdSsZT1LTYY0RMadZ+XAh6S1VB0TUxZhDSgDP5Odbkk7WqjJsekBI2hwg8nQ4HcQpupPUEHt2rWm6quE6dcXv6m6rgCl0djJJTaUOwkaSxkk6hdR0tAtwSom4wI8LxLiddPi5D2kP62hJV3UaVFKJq0OVrluxeBExp3YjnSE7p1QykNTRBrdJvB8WDHdBgRhvrh+hlPfki0w8WOq7y4ngHwvEGd/k1nEyyCZ2GmBY9iFExFn1zyX9I2Wu0Vy9t2Z1XRaR2lu/RxrO2klb6RrhO1o4je44sdZeSjor81BJH+64Zh2urKXrNsCf9QLSvEaldLzeNRhTMFaJuq0nactak47SWP1S26aS313HG9x6JfboG6zoNMCwTAhNbAyUyMIl/smnRcT9BeI086+dLJxHd3ydhk7ViOhkUrWajlbW0nUb4M9aegP+o8LxOhkR1KhSX1WDs4Bb8tFZAEeRmvJKKPnddbzBbdBxgpH0PuDHEfFyREzpNN6wbDJSmrXy7nxbAjwAfLNirC0knS2pB7hd0ln5ZJWqHpf0DUk9+dZRPEmTlGc2jYjzJG0mab/+luvDzyR9UJ2MXVtdt1eu1VBiZaVg3QYoXk2JjeQrIqKTqVJeIWkjSTt32jYv6QO1dTYi/k3SayUdVjVeRFwKfBB4gjT75+GFEnOx7y7HKrEO1yuRYKaS5mz6mtLcXB0Zrp3K9SccrQKeiIhVFWOVPjGldLzSJ/U8B2xC+t6eZ/UIrc0rxCp9rYZidWuI9xKpL6GTz1r62g+Hk0ZAjc716vSzHkJqA98gIsZL2pN0Rnvb/WLKJ0M1lN0ZbZ6sWVr+Pfu6NkXL352ka/uJVfWEw1f26Kssv5aYm5OGEX+UVOeLSecktD1N97BMCCWtZeVfo6zL4t0dFadfKK3kyprjbcWaF2a5ae1LDI7SG0lJS4FDOk2idfHuIA3rvLFWp6rrSbPlJC2OiDeVqGunJH2J1Cd0GSmRHkMaivm1NmK8Iz88nHQ9hNpUEEcDyyLicxXrVnQnqS7uNqTzLU4mzXK8E3BORLQ12/G60ofQidInppSOV+SkHq1ljqWaqkMLI+K3+ahoI9LK+gHgs5LaXlmVLsxyEml637tI54bcAkyuUrcc8/2snmjsxoioOgKnWfNrJ/9fT5TaWGSrIuLZQq1jpaeEL+3giKhvNj0/D+ZoOSHUdjIknRERB9S9dK2kSvNT5bjH1u0kXSypsz36wtPfV7qIwrp0I1356pekC88sI3XIvblwvD06iDeaNHx1BakN9nLSNAXtxrkh324lzbFeu7rTi8DNFet2COkEnLuBzwKjc/nGwK8qxCtyYZa6eLNIJ0Edl28LgFkVY10EfCP/Y74BOBu4pIO6fZN00uLRpL3Uw0lt61XjXQj8Zf4tJuQNxbcrxtokf3e1deSrpOkiKtWt9I20k3AMMIKUqI8hze9VJdZ9wBvqno8H7itQxyIXtCKdNHfAWl6b3G48Nxn1Q9KGpAuovBF4LfAsZU5yK3WiS9GTeiTNBWZGxOL8fHfgMxHxVxViXQpcEE1m/JQ0OSIWthnv9ojYR2kumf0izbZ5V1Rvbrsb2DNye66kEaQrZFVpRtmEdJLhn8Mr1374ckT8vmLdLm5SHB38rhuTpmF5dy76aa5f5WkOupWkcaSE+nbSEcx/kmYAXVYh1hRgNquPuscBH4uIn1asW+Me/Zyo26OPiKYTLvYTs9j0904I/ZD0E+AZ4Bf0nrHzrLUt00+8rwBfa9iAfzoqjoZo1k7dYdv1GhvYDje65VZW6QekvoiTSe3hT5MuwVh1ls27SR2/T+XnW5Gajbqi/6UkSX9G2kt+qa6s0lnGKjxjZ7fLO4W75Kf3RwcTGA7ATlKR6e9fieeE0DdJ90TE7gXjNduA/yKqjwr6JWmjVn9Sz01RsYNP0hXA7+l9RadNI+LoCrGKrqwNsTu6MEuOcTSp6eOGXL8DgNOjwvQVpTeSOd75wLYRsbukNwPvj4gvV4z3f6Qzs4+KiCdyWaX1ToWmNS9N0ikR8TWteSldoL15pfIor7WKiO9XqGItdsmdpCLT39e4U7l/t0h6U60JpYARStehfQHS2HBgww7ilT6p56PACaTOW0hXdKo6O+MpwF6NKyupvb0jUWBkUaSJ2W4kTV0h4NQO/jlLT2v8r6R+l38BiIi7JV1OuhhKFQ8AXwdulDQ9Im6h+gl0pWfsLKXWCd94Kd0qDunjtSDNpNy2JjtJ50rqZCepyPT3NU4I/dsf+CulSalegM6uqkXa816Y24iD1J5Yec6biLhU6aS5g3LdDo+GOf7bjPe8pG+Txko/UDVOVnRlLUVrXr2udtWw7SVtX6UZhfIbyY0j4raGUUGVzqXJIiJ+mPcor5R0ERWu65GVnta8iIi4NvcD7R4Rn+0w1kcLVatRkZ0kDdD0904I/XtPyWD5kPZuVnc+nlG1g6ou5r1A5SRQLw/D/DqwATBeFU5gGqiVtaC/JV3T4iyaXEKTlFzbVXoj+aTSpSprJxweATze9yJ9Uq7Pg7k/4WLSpS+rqM3sWr/RDcpMD9ORSFNA710qntIZ2Z9n9dDkm0j/D1UvoVlqJ2mzfP8/+VZzTcV6Ae5DGFKS3hfVx70PiBInMGmArtVQWm6u+zjpKDCA/wDOrzLyRuWv/fAG0uiWt5E6zx8Gjq0yUqaPv/HKlQCHE0lnkYbWfo/UHwZUa/dXoZkF6naS9gTeRNpwv7KTFBHHt1u3Jn+j8+nvnRCGTiedyQNF0qKI2K++87vdhLCWuB2vrKVJmgf8lnRhHEhj/l8bEUcNXa16y8NZ14uKZ3bXxRkF/A1rdnp3Oq9R6Rk7O1ZyyG6pUXeDsZNUYnviJqOhVXpStRLukfSXpM7vCcCnSG2cnfoxZaeELmHn6H0x8hvyCJqOlNhIStoW+AqwfUS8R+kqZW+NiAsrhryGdAT075Tp9K4pOiV0CYXb/4vMLNBsgz8AO0kdb0+G5WynryIfG+oKNPFJYDdSO/jlpBPxTupzidZ0Y/K7U9Kk2hOlWWL/s0DcEhvJS0gnj22fn/836fyLqjaOiFMjYl5EXF27dVhHKD8ldMckvUHStZJWSloh6Rqly3xWcTzwLUnLJC0jXYa01P9tiQta1eto+ntwQhh0ko5Unq4aOFjS99XPPEKDbNd8G0maJuJQ0vj1TnW8spaiPD06sB9pyO6y3AdwK6s7DztRYiO5TUTMI113m0iz9XayZ/9DSZVO4Guk8tOal3Y5MA/YjpRQv0ea3qWKyaT+g4vy7TJgnzzYolMlppgvOv29+xAGWa09XtL+pCaBs4DPRe/JuIZMHpb4GVJH2itT9Ea1i7FPApbU2r/zirtrRCwqVN1K1Ht69DVU/KxFpzXO50d8EFgQEW/J3+WZEfGOvpdca7zaVN9/JM1PBV0yrXlptX6whrL/iohJa1umj1iXk4745pM24O8l7SDtAnwv2phBtUnsj0fEeVWXzzHKTn/vhDC4ap21kr5Kuvbu5eqCueRrJN0cEfsXilV0Ze1mpTeS+ajxXGB3UnIeBRwREXd3WtcSVHha80J12io/PIU03cxcUt0+BGwYEWdUiPlT4IMR8bv8fFPgKtKsvXdExK5txiu6k7SWTu/Kg0DcqTz4/lfSv5DOQzhTaZ6Ubmq6+7ykC0izgNaPp69yZqaibo8j0mUrh+U6FwWnNc4nV70j33Ym7Zk+EBEv9rlg/3FLTfVNFJzWvKA7SAmg1hRT39YfQNsJAXg96aiq5kVgx4j4g6QqcxqdT+/BFb9vUtaOItPf1wzLf84udxQwBfjHiHhG0nb0PsFnqH2UdDi8PqubjKqeql90Ze12pTaS+eSqQyPibGBJibpJmkWaoqM2xPakPHrmtAqxys7BX0hEVO047svlwH/lkyshTWlxRR4OXOVk0NI7SccD5wB/T/o/XUgaXlyJm4wGSd3hbFMdnNFalApe+UrSaNLKehCrV9aTImJlifjdpMlGsqNpjSXNJE3edyW9T66qdKEilZ3qu+iMnaUpTeVyEXB55FmFO4y3N+nkRZGuDVJ5riRJ3wdupPdO0jsj4rCK8cpOf++EMDjyKJba4WztS68d2kZUPKO1NEn/CpwdHcyHVBer6MrazUpvJCXdkB/WrysR1WdPLTrVtwrO2FmapJ1IR7ofIk10dzHws+iCjV3pnaRm/Y+d9Ek6IQyy3LF6DDA+Ir4k6fXAdkM98qZG0n2kvdyOJ/MrvbJ2u5IbSUmfpnd7eJDOqu6JiLsqxJsKnEmZqb4HbFrzkvL/2vtIe+Mvk44avjmUR+PF9+gLT3/vPoTB9y3SynkQ8CXS5FZXk9p3u0HJceXrSdqyYWUdlutck41kp9Ma703z4Y4fk9TWcMe8YXyZdA3qElN9D9i05qUoXT/iONLklFeT+k72J/0+ew5dzXhzfTNWRDwtqZMdpKLT3w/Lf84ut18eV34nvLJCbDDUlaqJCmPw+1D6Wg3drPRGcmvSkN3acMfPk4Y7HkAaTdPOBeNflvSJSCe6za9Yn3pdOa15jdIEjc+Qrk1xaqy+wtkipaknhlLRnaQoPP29E8LgezF36NXG5o+i7gSw4aT0ytrlSm8kSw93XCDpM6zZSd1y84m6f1rzmg8DewHjgVOVrykREV+KNmcpHQDFd5Ki4PT3TgiD7xzSnPmj80iSI0hDxoalkitrNxrAjWTp4Y7H5Xp9vKG8ncEMAzIH/wD4Bquvg175+scDodt3ktypPAQk7UKaI0XAwk7ParWhowGc1rjwcMdm1374dkS0PXNnQ9xunNa86HXQ1yVOCGaFdelGckCu/aDuvKbHbODcKHcd9HWGm4zMyltnrv1AgRk7S5G0mHT0MxL4qKSHKHMd9HWGE4JZeV2zkaxzp6RJEfFfQMlrP3TNtOakcw6sA24yMitMBaY1LqVur3l90kR5j+TnOwL3VmlrLz1jp3UPJwSzArp1I6mBufbDOjOt+brGTUZmZZSe1riIwica1qwz05qva7ppHn6zV7M1NpIM3x2uhyR9StL6+XYSw3ha83WJE4JZGevSRvJ44G2kk/GWk65NXXkOfuseTghmZaxLG8mvA8dHxOiI2BY4kTSTqr3KDddDWrPBVttIPgOrpzUmTRkx3JSesdO6hI8QzMpYYyNJmmBtOFovJzxgeE9rvq7xj2hWxjpz7QfWrWnN1ynDdYU1G2zrzEay22fstOp8YppZIZJ2ZfVGcqE3kvZq44RgZmaAO5XNzCxzQjAzM8AJwczMMicEMzMDnBDMzCz7/8m3n1uatRdcAAAAAElFTkSuQmCC",
                        "text/plain": [
                            "<Figure size 432x288 with 1 Axes>"
                        ]
                    },
                    "metadata": {
                        "needs_background": "light"
                    },
                    "output_type": "display_data"
                }
            ],
            "source": [
                "# View distribution of license types in the dataset\n",
                "metadata['license'].value_counts().plot(kind='bar', title='License')"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 4,
            "metadata": {},
            "outputs": [],
            "source": [
                "# Extract metadata on public domain articles only\n",
                "metadata_public = metadata.loc[metadata['license']=='cc0']\n",
                "\n",
                "# Clean dataframe\n",
                "metadata_public = covid_utils.clean_dataframe(metadata_public)"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "Let's look at the top few rows of this dataframe which contains metadata on public domain articles."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 5,
            "metadata": {},
            "outputs": [
                {
                    "name": "stdout",
                    "output_type": "stream",
                    "text": [
                        "Number of articles in dataset: 134206\n",
                        "Number of articles in dataset that fall under the public domain (cc0): 274\n"
                    ]
                },
                {
                    "data": {
                        "text/html": [
                            "<div>\n",
                            "<style scoped>\n",
                            "    .dataframe tbody tr th:only-of-type {\n",
                            "        vertical-align: middle;\n",
                            "    }\n",
                            "\n",
                            "    .dataframe tbody tr th {\n",
                            "        vertical-align: top;\n",
                            "    }\n",
                            "\n",
                            "    .dataframe thead th {\n",
                            "        text-align: right;\n",
                            "    }\n",
                            "</style>\n",
                            "<table border=\"1\" class=\"dataframe\">\n",
                            "  <thead>\n",
                            "    <tr style=\"text-align: right;\">\n",
                            "      <th></th>\n",
                            "      <th>cord_uid</th>\n",
                            "      <th>sha</th>\n",
                            "      <th>source_x</th>\n",
                            "      <th>title</th>\n",
                            "      <th>doi</th>\n",
                            "      <th>pmcid</th>\n",
                            "      <th>pubmed_id</th>\n",
                            "      <th>license</th>\n",
                            "      <th>abstract</th>\n",
                            "      <th>publish_time</th>\n",
                            "      <th>authors</th>\n",
                            "      <th>journal</th>\n",
                            "      <th>mag_id</th>\n",
                            "      <th>who_covidence_id</th>\n",
                            "      <th>arxiv_id</th>\n",
                            "      <th>pdf_json_files</th>\n",
                            "      <th>pmc_json_files</th>\n",
                            "      <th>url</th>\n",
                            "      <th>s2_id</th>\n",
                            "    </tr>\n",
                            "  </thead>\n",
                            "  <tbody>\n",
                            "    <tr>\n",
                            "      <th>0</th>\n",
                            "      <td>ej795nks</td>\n",
                            "      <td>a3c3b7c38ad32e1042d78aae2027ca491e9f2197</td>\n",
                            "      <td>PMC</td>\n",
                            "      <td>Understanding the Spatial Clustering of Severe...</td>\n",
                            "      <td>10.1289/ehp.7117</td>\n",
                            "      <td>PMC1247620</td>\n",
                            "      <td>15531441.0</td>\n",
                            "      <td>cc0</td>\n",
                            "      <td>We applied cartographic and geostatistical met...</td>\n",
                            "      <td>2004-07-27</td>\n",
                            "      <td>Lai, P.C.; Wong, C.M.; Hedley, A.J.; Lo, S.V.;...</td>\n",
                            "      <td>Environ Health Perspect</td>\n",
                            "      <td>NaN</td>\n",
                            "      <td>NaN</td>\n",
                            "      <td>NaN</td>\n",
                            "      <td>document_parses/pdf_json/a3c3b7c38ad32e1042d78...</td>\n",
                            "      <td>document_parses/pmc_json/PMC1247620.xml.json</td>\n",
                            "      <td>https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...</td>\n",
                            "      <td>NaN</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>1</th>\n",
                            "      <td>9mzs5dl4</td>\n",
                            "      <td>c1c6a98c21304f3788b20870b34afd8a115fa38c</td>\n",
                            "      <td>PMC</td>\n",
                            "      <td>The Application of the Haddon Matrix to Public...</td>\n",
                            "      <td>10.1289/ehp.7491</td>\n",
                            "      <td>PMC1257548</td>\n",
                            "      <td>15866764.0</td>\n",
                            "      <td>cc0</td>\n",
                            "      <td>State and local health departments continue to...</td>\n",
                            "      <td>2005-02-02</td>\n",
                            "      <td>Barnett, Daniel J.; Balicer, Ran D.; Blodgett,...</td>\n",
                            "      <td>Environ Health Perspect</td>\n",
                            "      <td>NaN</td>\n",
                            "      <td>NaN</td>\n",
                            "      <td>NaN</td>\n",
                            "      <td>document_parses/pdf_json/c1c6a98c21304f3788b20...</td>\n",
                            "      <td>document_parses/pmc_json/PMC1257548.xml.json</td>\n",
                            "      <td>https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...</td>\n",
                            "      <td>NaN</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>2</th>\n",
                            "      <td>u7lz3spe</td>\n",
                            "      <td>c56ffdaf1cfbae5a6ed0abea495eaf7fa1cbc031</td>\n",
                            "      <td>PMC</td>\n",
                            "      <td>Cynomolgus Macaque as an Animal Model for Seve...</td>\n",
                            "      <td>10.1371/journal.pmed.0030149</td>\n",
                            "      <td>PMC1435788</td>\n",
                            "      <td>16605302.0</td>\n",
                            "      <td>cc0</td>\n",
                            "      <td>BACKGROUND: The emergence of severe acute resp...</td>\n",
                            "      <td>2006-04-18</td>\n",
                            "      <td>Lawler, James V; Endy, Timothy P; Hensley, Lis...</td>\n",
                            "      <td>PLoS Med</td>\n",
                            "      <td>NaN</td>\n",
                            "      <td>NaN</td>\n",
                            "      <td>NaN</td>\n",
                            "      <td>document_parses/pdf_json/c56ffdaf1cfbae5a6ed0a...</td>\n",
                            "      <td>document_parses/pmc_json/PMC1435788.xml.json</td>\n",
                            "      <td>https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...</td>\n",
                            "      <td>NaN</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>3</th>\n",
                            "      <td>na7z92i8</td>\n",
                            "      <td>f38f3b112e4b702b60ba56be806d418bbb2b83c3</td>\n",
                            "      <td>PMC</td>\n",
                            "      <td>Immune Protection of Nonhuman Primates against...</td>\n",
                            "      <td>10.1371/journal.pmed.0030177</td>\n",
                            "      <td>PMC1459482</td>\n",
                            "      <td>16683867.0</td>\n",
                            "      <td>cc0</td>\n",
                            "      <td>BACKGROUND: Ebola virus causes a hemorrhagic f...</td>\n",
                            "      <td>2006-05-16</td>\n",
                            "      <td>Sullivan, Nancy J; Geisbert, Thomas W; Geisber...</td>\n",
                            "      <td>PLoS Med</td>\n",
                            "      <td>NaN</td>\n",
                            "      <td>NaN</td>\n",
                            "      <td>NaN</td>\n",
                            "      <td>document_parses/pdf_json/f38f3b112e4b702b60ba5...</td>\n",
                            "      <td>document_parses/pmc_json/PMC1459482.xml.json</td>\n",
                            "      <td>https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...</td>\n",
                            "      <td>NaN</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>4</th>\n",
                            "      <td>j35w1vsw</td>\n",
                            "      <td>5dc0b8b662824323881c3a1ae3a1bae2a821484d</td>\n",
                            "      <td>PMC</td>\n",
                            "      <td>SARS: Systematic Review of Treatment Effects</td>\n",
                            "      <td>10.1371/journal.pmed.0030343</td>\n",
                            "      <td>PMC1564166</td>\n",
                            "      <td>16968120.0</td>\n",
                            "      <td>cc0</td>\n",
                            "      <td>BACKGROUND: The SARS outbreak of 2002–2003 pre...</td>\n",
                            "      <td>2006-09-12</td>\n",
                            "      <td>Stockman, Lauren J; Bellamy, Richard; Garner, ...</td>\n",
                            "      <td>PLoS Med</td>\n",
                            "      <td>NaN</td>\n",
                            "      <td>NaN</td>\n",
                            "      <td>NaN</td>\n",
                            "      <td>document_parses/pdf_json/5dc0b8b662824323881c3...</td>\n",
                            "      <td>document_parses/pmc_json/PMC1564166.xml.json</td>\n",
                            "      <td>https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...</td>\n",
                            "      <td>NaN</td>\n",
                            "    </tr>\n",
                            "  </tbody>\n",
                            "</table>\n",
                            "</div>"
                        ],
                        "text/plain": [
                            "   cord_uid                                       sha source_x  \\\n",
                            "0  ej795nks  a3c3b7c38ad32e1042d78aae2027ca491e9f2197      PMC   \n",
                            "1  9mzs5dl4  c1c6a98c21304f3788b20870b34afd8a115fa38c      PMC   \n",
                            "2  u7lz3spe  c56ffdaf1cfbae5a6ed0abea495eaf7fa1cbc031      PMC   \n",
                            "3  na7z92i8  f38f3b112e4b702b60ba56be806d418bbb2b83c3      PMC   \n",
                            "4  j35w1vsw  5dc0b8b662824323881c3a1ae3a1bae2a821484d      PMC   \n",
                            "\n",
                            "                                               title  \\\n",
                            "0  Understanding the Spatial Clustering of Severe...   \n",
                            "1  The Application of the Haddon Matrix to Public...   \n",
                            "2  Cynomolgus Macaque as an Animal Model for Seve...   \n",
                            "3  Immune Protection of Nonhuman Primates against...   \n",
                            "4       SARS: Systematic Review of Treatment Effects   \n",
                            "\n",
                            "                            doi       pmcid   pubmed_id license  \\\n",
                            "0              10.1289/ehp.7117  PMC1247620  15531441.0     cc0   \n",
                            "1              10.1289/ehp.7491  PMC1257548  15866764.0     cc0   \n",
                            "2  10.1371/journal.pmed.0030149  PMC1435788  16605302.0     cc0   \n",
                            "3  10.1371/journal.pmed.0030177  PMC1459482  16683867.0     cc0   \n",
                            "4  10.1371/journal.pmed.0030343  PMC1564166  16968120.0     cc0   \n",
                            "\n",
                            "                                            abstract publish_time  \\\n",
                            "0  We applied cartographic and geostatistical met...   2004-07-27   \n",
                            "1  State and local health departments continue to...   2005-02-02   \n",
                            "2  BACKGROUND: The emergence of severe acute resp...   2006-04-18   \n",
                            "3  BACKGROUND: Ebola virus causes a hemorrhagic f...   2006-05-16   \n",
                            "4  BACKGROUND: The SARS outbreak of 2002–2003 pre...   2006-09-12   \n",
                            "\n",
                            "                                             authors                  journal  \\\n",
                            "0  Lai, P.C.; Wong, C.M.; Hedley, A.J.; Lo, S.V.;...  Environ Health Perspect   \n",
                            "1  Barnett, Daniel J.; Balicer, Ran D.; Blodgett,...  Environ Health Perspect   \n",
                            "2  Lawler, James V; Endy, Timothy P; Hensley, Lis...                 PLoS Med   \n",
                            "3  Sullivan, Nancy J; Geisbert, Thomas W; Geisber...                 PLoS Med   \n",
                            "4  Stockman, Lauren J; Bellamy, Richard; Garner, ...                 PLoS Med   \n",
                            "\n",
                            "   mag_id who_covidence_id arxiv_id  \\\n",
                            "0     NaN              NaN      NaN   \n",
                            "1     NaN              NaN      NaN   \n",
                            "2     NaN              NaN      NaN   \n",
                            "3     NaN              NaN      NaN   \n",
                            "4     NaN              NaN      NaN   \n",
                            "\n",
                            "                                      pdf_json_files  \\\n",
                            "0  document_parses/pdf_json/a3c3b7c38ad32e1042d78...   \n",
                            "1  document_parses/pdf_json/c1c6a98c21304f3788b20...   \n",
                            "2  document_parses/pdf_json/c56ffdaf1cfbae5a6ed0a...   \n",
                            "3  document_parses/pdf_json/f38f3b112e4b702b60ba5...   \n",
                            "4  document_parses/pdf_json/5dc0b8b662824323881c3...   \n",
                            "\n",
                            "                                 pmc_json_files  \\\n",
                            "0  document_parses/pmc_json/PMC1247620.xml.json   \n",
                            "1  document_parses/pmc_json/PMC1257548.xml.json   \n",
                            "2  document_parses/pmc_json/PMC1435788.xml.json   \n",
                            "3  document_parses/pmc_json/PMC1459482.xml.json   \n",
                            "4  document_parses/pmc_json/PMC1564166.xml.json   \n",
                            "\n",
                            "                                                 url  s2_id  \n",
                            "0  https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...    NaN  \n",
                            "1  https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...    NaN  \n",
                            "2  https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...    NaN  \n",
                            "3  https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...    NaN  \n",
                            "4  https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...    NaN  "
                        ]
                    },
                    "execution_count": 5,
                    "metadata": {},
                    "output_type": "execute_result"
                }
            ],
            "source": [
                "# Preview metadata for public domain articles\n",
                "print('Number of articles in dataset: ' + str(len(metadata)))\n",
                "print('Number of articles in dataset that fall under the public domain (cc0): ' + str(len(metadata_public)))\n",
                "metadata_public.head()"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### 3. Retrieve full article text\n",
                "Now that we have the metadata for the public domain articles as its own dataframe, let's retrieve the full text for each public domain scientific article."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 6,
            "metadata": {},
            "outputs": [],
            "source": [
                "# Extract text from all public domain articles (may take 2-3 min)\n",
                "all_text = covid_utils.get_public_domain_text(df=metadata_public, container_name=container_name, azure_storage_sas_token=sas_token)"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "Notice that **all_text** is the same as **metadata_public** but now has an additional column called **full_text** which contains the full text for each respective article."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 7,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/html": [
                            "<div>\n",
                            "<style scoped>\n",
                            "    .dataframe tbody tr th:only-of-type {\n",
                            "        vertical-align: middle;\n",
                            "    }\n",
                            "\n",
                            "    .dataframe tbody tr th {\n",
                            "        vertical-align: top;\n",
                            "    }\n",
                            "\n",
                            "    .dataframe thead th {\n",
                            "        text-align: right;\n",
                            "    }\n",
                            "</style>\n",
                            "<table border=\"1\" class=\"dataframe\">\n",
                            "  <thead>\n",
                            "    <tr style=\"text-align: right;\">\n",
                            "      <th></th>\n",
                            "      <th>index</th>\n",
                            "      <th>cord_uid</th>\n",
                            "      <th>doi</th>\n",
                            "      <th>title</th>\n",
                            "      <th>publish_time</th>\n",
                            "      <th>authors</th>\n",
                            "      <th>journal</th>\n",
                            "      <th>url</th>\n",
                            "      <th>abstract</th>\n",
                            "      <th>full_text</th>\n",
                            "    </tr>\n",
                            "  </thead>\n",
                            "  <tbody>\n",
                            "    <tr>\n",
                            "      <th>0</th>\n",
                            "      <td>0</td>\n",
                            "      <td>ej795nks</td>\n",
                            "      <td>10.1289/ehp.7117</td>\n",
                            "      <td>Understanding the Spatial Clustering of Severe...</td>\n",
                            "      <td>2004-07-27</td>\n",
                            "      <td>Lai, P.C.; Wong, C.M.; Hedley, A.J.; Lo, S.V.;...</td>\n",
                            "      <td>Environ Health Perspect</td>\n",
                            "      <td>https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...</td>\n",
                            "      <td>We applied cartographic and geostatistical met...</td>\n",
                            "      <td>Since the emergence and rapid spread of the et...</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>1</th>\n",
                            "      <td>1</td>\n",
                            "      <td>9mzs5dl4</td>\n",
                            "      <td>10.1289/ehp.7491</td>\n",
                            "      <td>The Application of the Haddon Matrix to Public...</td>\n",
                            "      <td>2005-02-02</td>\n",
                            "      <td>Barnett, Daniel J.; Balicer, Ran D.; Blodgett,...</td>\n",
                            "      <td>Environ Health Perspect</td>\n",
                            "      <td>https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...</td>\n",
                            "      <td>State and local health departments continue to...</td>\n",
                            "      <td>sudden fever and dry cough, along with chills ...</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>2</th>\n",
                            "      <td>2</td>\n",
                            "      <td>u7lz3spe</td>\n",
                            "      <td>10.1371/journal.pmed.0030149</td>\n",
                            "      <td>Cynomolgus Macaque as an Animal Model for Seve...</td>\n",
                            "      <td>2006-04-18</td>\n",
                            "      <td>Lawler, James V; Endy, Timothy P; Hensley, Lis...</td>\n",
                            "      <td>PLoS Med</td>\n",
                            "      <td>https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...</td>\n",
                            "      <td>BACKGROUND: The emergence of severe acute resp...</td>\n",
                            "      <td>The emergence of severe acute respiratory synd...</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>3</th>\n",
                            "      <td>3</td>\n",
                            "      <td>na7z92i8</td>\n",
                            "      <td>10.1371/journal.pmed.0030177</td>\n",
                            "      <td>Immune Protection of Nonhuman Primates against...</td>\n",
                            "      <td>2006-05-16</td>\n",
                            "      <td>Sullivan, Nancy J; Geisbert, Thomas W; Geisber...</td>\n",
                            "      <td>PLoS Med</td>\n",
                            "      <td>https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...</td>\n",
                            "      <td>BACKGROUND: Ebola virus causes a hemorrhagic f...</td>\n",
                            "      <td>Background Ebola virus causes a hemorrhagic fe...</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>4</th>\n",
                            "      <td>4</td>\n",
                            "      <td>j35w1vsw</td>\n",
                            "      <td>10.1371/journal.pmed.0030343</td>\n",
                            "      <td>SARS: Systematic Review of Treatment Effects</td>\n",
                            "      <td>2006-09-12</td>\n",
                            "      <td>Stockman, Lauren J; Bellamy, Richard; Garner, ...</td>\n",
                            "      <td>PLoS Med</td>\n",
                            "      <td>https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...</td>\n",
                            "      <td>BACKGROUND: The SARS outbreak of 2002–2003 pre...</td>\n",
                            "      <td>The SARS outbreak of 2002-2003 presented clini...</td>\n",
                            "    </tr>\n",
                            "  </tbody>\n",
                            "</table>\n",
                            "</div>"
                        ],
                        "text/plain": [
                            "   index  cord_uid                           doi  \\\n",
                            "0      0  ej795nks              10.1289/ehp.7117   \n",
                            "1      1  9mzs5dl4              10.1289/ehp.7491   \n",
                            "2      2  u7lz3spe  10.1371/journal.pmed.0030149   \n",
                            "3      3  na7z92i8  10.1371/journal.pmed.0030177   \n",
                            "4      4  j35w1vsw  10.1371/journal.pmed.0030343   \n",
                            "\n",
                            "                                               title publish_time  \\\n",
                            "0  Understanding the Spatial Clustering of Severe...   2004-07-27   \n",
                            "1  The Application of the Haddon Matrix to Public...   2005-02-02   \n",
                            "2  Cynomolgus Macaque as an Animal Model for Seve...   2006-04-18   \n",
                            "3  Immune Protection of Nonhuman Primates against...   2006-05-16   \n",
                            "4       SARS: Systematic Review of Treatment Effects   2006-09-12   \n",
                            "\n",
                            "                                             authors                  journal  \\\n",
                            "0  Lai, P.C.; Wong, C.M.; Hedley, A.J.; Lo, S.V.;...  Environ Health Perspect   \n",
                            "1  Barnett, Daniel J.; Balicer, Ran D.; Blodgett,...  Environ Health Perspect   \n",
                            "2  Lawler, James V; Endy, Timothy P; Hensley, Lis...                 PLoS Med   \n",
                            "3  Sullivan, Nancy J; Geisbert, Thomas W; Geisber...                 PLoS Med   \n",
                            "4  Stockman, Lauren J; Bellamy, Richard; Garner, ...                 PLoS Med   \n",
                            "\n",
                            "                                                 url  \\\n",
                            "0  https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...   \n",
                            "1  https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...   \n",
                            "2  https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...   \n",
                            "3  https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...   \n",
                            "4  https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...   \n",
                            "\n",
                            "                                            abstract  \\\n",
                            "0  We applied cartographic and geostatistical met...   \n",
                            "1  State and local health departments continue to...   \n",
                            "2  BACKGROUND: The emergence of severe acute resp...   \n",
                            "3  BACKGROUND: Ebola virus causes a hemorrhagic f...   \n",
                            "4  BACKGROUND: The SARS outbreak of 2002–2003 pre...   \n",
                            "\n",
                            "                                           full_text  \n",
                            "0  Since the emergence and rapid spread of the et...  \n",
                            "1  sudden fever and dry cough, along with chills ...  \n",
                            "2  The emergence of severe acute respiratory synd...  \n",
                            "3  Background Ebola virus causes a hemorrhagic fe...  \n",
                            "4  The SARS outbreak of 2002-2003 presented clini...  "
                        ]
                    },
                    "execution_count": 7,
                    "metadata": {},
                    "output_type": "execute_result"
                }
            ],
            "source": [
                "# Preview\n",
                "all_text.head()"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### 4. Instantiate the recommender\n",
                "All functions for data preparation and recommendation are contained within the **TfidfRecommender** class we have imported. Prior to running these functions, we must create an object of this class.\n",
                "\n",
                "Select one of the following tokenization methods to use in the model:\n",
                "\n",
                "| tokenization_method | Description                                                                                                                      |\n",
                "|:--------------------|:---------------------------------------------------------------------------------------------------------------------------------|\n",
                "| 'none'              | No tokenization is applied. Each word is considered a token.                                                                     |\n",
                "| 'nltk'              | Simple stemming is applied using NLTK.                                                                                           |\n",
                "| 'bert'              | HuggingFace BERT word tokenization ('bert-base-cased') is applied.                                                               |\n",
                "| 'scibert'           | SciBERT word tokenization ('allenai/scibert_scivocab_cased') is applied.<br>This is recommended for scientific journal articles. |"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 8,
            "metadata": {},
            "outputs": [],
            "source": [
                "# Create the recommender object\n",
                "recommender = TfidfRecommender(id_col='cord_uid', tokenization_method='scibert')"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### 5. Prepare text for use in the TF-IDF recommender\n",
                "The raw text retrieved for each article requires basic cleaning prior to being used in the TF-IDF model.\n",
                "\n",
                "Let's look at the full_text from the first article in our dataframe as an example."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 9,
            "metadata": {},
            "outputs": [
                {
                    "name": "stdout",
                    "output_type": "stream",
                    "text": [
                        "Since the emergence and rapid spread of the etiologic agent of severe acute respiratory syndrome (SARS)-SARS coronavirus (SARS-CoV)-in late 2002 and during the first 6 months of 2003, great progress has been made in understanding the biology, pathogenesis, and epidemiology of both the disease and the virus (SARS-CoV). Much remains to be done, however, including the development of effective therapeutic interventions and diagnostic tools with high sensitivity and specificity soon after the onset of clinical symptoms. The evaluation of key epidemiologic parameters and the impact of different public health interventions in the various settings that experienced minor or major epidemics is also needed (Affonso et al. 2004; Cui et al. 2003; Lau et al. 2004; Leung et al., in press) . In terms of outbreak control on the population level, many questions about \"superspreading events\" (SSEs) remain to be investigated. Such an SSE was responsible for > 300 cases (out of a total of 1,755) in the Amo\n"
                    ]
                }
            ],
            "source": [
                "# Preview the first 1000 characters of the full scientific text from one example\n",
                "print(all_text['full_text'][0][:1000])"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "As seen above, there are some special characters (such as • ▲ ■ ≥ °) and punctuation which should be removed prior to using the text as input. Casing (capitalization) is preserved for [BERT-based tokenization methods](https://huggingface.co/transformers/model_doc/bert.html), but is removed for simple or no tokenization.\n",
                "\n",
                "Let's join together the **title**, **abstract**, and **full_text** columns and clean them for future use in the TF-IDF model."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 10,
            "metadata": {},
            "outputs": [],
            "source": [
                "# Assign columns to clean and combine\n",
                "cols_to_clean = ['title','abstract','full_text']\n",
                "clean_col = 'cleaned_text'\n",
                "df_clean = recommender.clean_dataframe(all_text, cols_to_clean, clean_col)"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 11,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/html": [
                            "<div>\n",
                            "<style scoped>\n",
                            "    .dataframe tbody tr th:only-of-type {\n",
                            "        vertical-align: middle;\n",
                            "    }\n",
                            "\n",
                            "    .dataframe tbody tr th {\n",
                            "        vertical-align: top;\n",
                            "    }\n",
                            "\n",
                            "    .dataframe thead th {\n",
                            "        text-align: right;\n",
                            "    }\n",
                            "</style>\n",
                            "<table border=\"1\" class=\"dataframe\">\n",
                            "  <thead>\n",
                            "    <tr style=\"text-align: right;\">\n",
                            "      <th></th>\n",
                            "      <th>index</th>\n",
                            "      <th>cord_uid</th>\n",
                            "      <th>doi</th>\n",
                            "      <th>title</th>\n",
                            "      <th>publish_time</th>\n",
                            "      <th>authors</th>\n",
                            "      <th>journal</th>\n",
                            "      <th>url</th>\n",
                            "      <th>abstract</th>\n",
                            "      <th>full_text</th>\n",
                            "      <th>cleaned_text</th>\n",
                            "    </tr>\n",
                            "  </thead>\n",
                            "  <tbody>\n",
                            "    <tr>\n",
                            "      <th>0</th>\n",
                            "      <td>0</td>\n",
                            "      <td>ej795nks</td>\n",
                            "      <td>10.1289/ehp.7117</td>\n",
                            "      <td>Understanding the Spatial Clustering of Severe...</td>\n",
                            "      <td>2004-07-27</td>\n",
                            "      <td>Lai, P.C.; Wong, C.M.; Hedley, A.J.; Lo, S.V.;...</td>\n",
                            "      <td>Environ Health Perspect</td>\n",
                            "      <td>https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...</td>\n",
                            "      <td>We applied cartographic and geostatistical met...</td>\n",
                            "      <td>Since the emergence and rapid spread of the et...</td>\n",
                            "      <td>Understanding the Spatial Clustering of Severe...</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>1</th>\n",
                            "      <td>1</td>\n",
                            "      <td>9mzs5dl4</td>\n",
                            "      <td>10.1289/ehp.7491</td>\n",
                            "      <td>The Application of the Haddon Matrix to Public...</td>\n",
                            "      <td>2005-02-02</td>\n",
                            "      <td>Barnett, Daniel J.; Balicer, Ran D.; Blodgett,...</td>\n",
                            "      <td>Environ Health Perspect</td>\n",
                            "      <td>https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...</td>\n",
                            "      <td>State and local health departments continue to...</td>\n",
                            "      <td>sudden fever and dry cough, along with chills ...</td>\n",
                            "      <td>The Application of the Haddon Matrix to Public...</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>2</th>\n",
                            "      <td>2</td>\n",
                            "      <td>u7lz3spe</td>\n",
                            "      <td>10.1371/journal.pmed.0030149</td>\n",
                            "      <td>Cynomolgus Macaque as an Animal Model for Seve...</td>\n",
                            "      <td>2006-04-18</td>\n",
                            "      <td>Lawler, James V; Endy, Timothy P; Hensley, Lis...</td>\n",
                            "      <td>PLoS Med</td>\n",
                            "      <td>https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...</td>\n",
                            "      <td>BACKGROUND: The emergence of severe acute resp...</td>\n",
                            "      <td>The emergence of severe acute respiratory synd...</td>\n",
                            "      <td>Cynomolgus Macaque as an Animal Model for Seve...</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>3</th>\n",
                            "      <td>3</td>\n",
                            "      <td>na7z92i8</td>\n",
                            "      <td>10.1371/journal.pmed.0030177</td>\n",
                            "      <td>Immune Protection of Nonhuman Primates against...</td>\n",
                            "      <td>2006-05-16</td>\n",
                            "      <td>Sullivan, Nancy J; Geisbert, Thomas W; Geisber...</td>\n",
                            "      <td>PLoS Med</td>\n",
                            "      <td>https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...</td>\n",
                            "      <td>BACKGROUND: Ebola virus causes a hemorrhagic f...</td>\n",
                            "      <td>Background Ebola virus causes a hemorrhagic fe...</td>\n",
                            "      <td>Immune Protection of Nonhuman Primates against...</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>4</th>\n",
                            "      <td>4</td>\n",
                            "      <td>j35w1vsw</td>\n",
                            "      <td>10.1371/journal.pmed.0030343</td>\n",
                            "      <td>SARS: Systematic Review of Treatment Effects</td>\n",
                            "      <td>2006-09-12</td>\n",
                            "      <td>Stockman, Lauren J; Bellamy, Richard; Garner, ...</td>\n",
                            "      <td>PLoS Med</td>\n",
                            "      <td>https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...</td>\n",
                            "      <td>BACKGROUND: The SARS outbreak of 2002–2003 pre...</td>\n",
                            "      <td>The SARS outbreak of 2002-2003 presented clini...</td>\n",
                            "      <td>SARS Systematic Review of Treatment Effects BA...</td>\n",
                            "    </tr>\n",
                            "  </tbody>\n",
                            "</table>\n",
                            "</div>"
                        ],
                        "text/plain": [
                            "   index  cord_uid                           doi  \\\n",
                            "0      0  ej795nks              10.1289/ehp.7117   \n",
                            "1      1  9mzs5dl4              10.1289/ehp.7491   \n",
                            "2      2  u7lz3spe  10.1371/journal.pmed.0030149   \n",
                            "3      3  na7z92i8  10.1371/journal.pmed.0030177   \n",
                            "4      4  j35w1vsw  10.1371/journal.pmed.0030343   \n",
                            "\n",
                            "                                               title publish_time  \\\n",
                            "0  Understanding the Spatial Clustering of Severe...   2004-07-27   \n",
                            "1  The Application of the Haddon Matrix to Public...   2005-02-02   \n",
                            "2  Cynomolgus Macaque as an Animal Model for Seve...   2006-04-18   \n",
                            "3  Immune Protection of Nonhuman Primates against...   2006-05-16   \n",
                            "4       SARS: Systematic Review of Treatment Effects   2006-09-12   \n",
                            "\n",
                            "                                             authors                  journal  \\\n",
                            "0  Lai, P.C.; Wong, C.M.; Hedley, A.J.; Lo, S.V.;...  Environ Health Perspect   \n",
                            "1  Barnett, Daniel J.; Balicer, Ran D.; Blodgett,...  Environ Health Perspect   \n",
                            "2  Lawler, James V; Endy, Timothy P; Hensley, Lis...                 PLoS Med   \n",
                            "3  Sullivan, Nancy J; Geisbert, Thomas W; Geisber...                 PLoS Med   \n",
                            "4  Stockman, Lauren J; Bellamy, Richard; Garner, ...                 PLoS Med   \n",
                            "\n",
                            "                                                 url  \\\n",
                            "0  https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...   \n",
                            "1  https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...   \n",
                            "2  https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...   \n",
                            "3  https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...   \n",
                            "4  https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...   \n",
                            "\n",
                            "                                            abstract  \\\n",
                            "0  We applied cartographic and geostatistical met...   \n",
                            "1  State and local health departments continue to...   \n",
                            "2  BACKGROUND: The emergence of severe acute resp...   \n",
                            "3  BACKGROUND: Ebola virus causes a hemorrhagic f...   \n",
                            "4  BACKGROUND: The SARS outbreak of 2002–2003 pre...   \n",
                            "\n",
                            "                                           full_text  \\\n",
                            "0  Since the emergence and rapid spread of the et...   \n",
                            "1  sudden fever and dry cough, along with chills ...   \n",
                            "2  The emergence of severe acute respiratory synd...   \n",
                            "3  Background Ebola virus causes a hemorrhagic fe...   \n",
                            "4  The SARS outbreak of 2002-2003 presented clini...   \n",
                            "\n",
                            "                                        cleaned_text  \n",
                            "0  Understanding the Spatial Clustering of Severe...  \n",
                            "1  The Application of the Haddon Matrix to Public...  \n",
                            "2  Cynomolgus Macaque as an Animal Model for Seve...  \n",
                            "3  Immune Protection of Nonhuman Primates against...  \n",
                            "4  SARS Systematic Review of Treatment Effects BA...  "
                        ]
                    },
                    "execution_count": 11,
                    "metadata": {},
                    "output_type": "execute_result"
                }
            ],
            "source": [
                "# Preview the dataframe with the cleaned text\n",
                "df_clean.head()"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 12,
            "metadata": {},
            "outputs": [
                {
                    "name": "stdout",
                    "output_type": "stream",
                    "text": [
                        "Understanding the Spatial Clustering of Severe Acute Respiratory Syndrome SARS in Hong Kong We applied cartographic and geostatistical methods in analyzing the patterns of disease spread during the 2003 severe acute respiratory syndrome SARS outbreak in Hong Kong using geographic information system GIS technology We analyzed an integrated database that contained clinical and personal details on all 1755 patients confirmed to have SARS from 15 February to 22 June 2003 Elementary mapping of disease occurrences in space and time simultaneously revealed the geographic extent of spread throughout the territory Statistical surfaces created by the kernel method confirmed that SARS cases were highly clustered and identified distinct disease hot spots Contextual analysis of mean and standard deviation of different density classes indicated that the period from day 1 18 February through day 16 6 March was the prodrome of the epidemic whereas days 86 15 May to 106 4 June marked the declining phas\n"
                    ]
                }
            ],
            "source": [
                "# Preview the first 1000 characters of the cleaned version of the previous example\n",
                "print(df_clean[clean_col][0][:1000])"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "Let's also tokenize the cleaned text for use in the TF-IDF model. The tokens are stored within our TfidfRecommender object."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 13,
            "metadata": {},
            "outputs": [],
            "source": [
                "# Tokenize text with tokenization_method specified in class instantiation\n",
                "tf, vectors_tokenized = recommender.tokenize_text(df_clean, text_col=clean_col)"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### 6. Recommend articles using TF-IDF\n",
                "Let's now fit the recommender model to the processed data (tokens) and retrieve the top k recommended articles.\n",
                "\n",
                "When creating our object, we specified k=5 so the `recommend_top_k_items` function will return the top 5 recommendations for each public domain article."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 14,
            "metadata": {
                "scrolled": true
            },
            "outputs": [],
            "source": [
                "# Fit the TF-IDF vectorizer\n",
                "recommender.fit(tf, vectors_tokenized)\n",
                "\n",
                "# Get recommendations\n",
                "top_k_recommendations = recommender.recommend_top_k_items(df_clean, k=5)"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "In our recommendation table, each row represents a single recommendation.\n",
                "\n",
                "- **cord_uid** corresponds to the article that is being used to make recommendations from.\n",
                "- **rec_rank** contains the recommdation's rank (e.g., rank of 1 means top recommendation).\n",
                "- **rec_score** is the cosine similarity score between the query article and the recommended article.\n",
                "- **rec_cord_uid** corresponds to the recommended article."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 15,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/html": [
                            "<div>\n",
                            "<style scoped>\n",
                            "    .dataframe tbody tr th:only-of-type {\n",
                            "        vertical-align: middle;\n",
                            "    }\n",
                            "\n",
                            "    .dataframe tbody tr th {\n",
                            "        vertical-align: top;\n",
                            "    }\n",
                            "\n",
                            "    .dataframe thead th {\n",
                            "        text-align: right;\n",
                            "    }\n",
                            "</style>\n",
                            "<table border=\"1\" class=\"dataframe\">\n",
                            "  <thead>\n",
                            "    <tr style=\"text-align: right;\">\n",
                            "      <th></th>\n",
                            "      <th>cord_uid</th>\n",
                            "      <th>rec_rank</th>\n",
                            "      <th>rec_score</th>\n",
                            "      <th>rec_cord_uid</th>\n",
                            "    </tr>\n",
                            "  </thead>\n",
                            "  <tbody>\n",
                            "    <tr>\n",
                            "      <th>0</th>\n",
                            "      <td>ej795nks</td>\n",
                            "      <td>1</td>\n",
                            "      <td>0.142033</td>\n",
                            "      <td>u7lz3spe</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>1</th>\n",
                            "      <td>ej795nks</td>\n",
                            "      <td>2</td>\n",
                            "      <td>0.117743</td>\n",
                            "      <td>j35w1vsw</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>2</th>\n",
                            "      <td>ej795nks</td>\n",
                            "      <td>3</td>\n",
                            "      <td>0.100325</td>\n",
                            "      <td>nt60lv2k</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>3</th>\n",
                            "      <td>ej795nks</td>\n",
                            "      <td>4</td>\n",
                            "      <td>0.076779</td>\n",
                            "      <td>vp9d9vmp</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>4</th>\n",
                            "      <td>ej795nks</td>\n",
                            "      <td>5</td>\n",
                            "      <td>0.074392</td>\n",
                            "      <td>05d1mhkq</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>...</th>\n",
                            "      <td>...</td>\n",
                            "      <td>...</td>\n",
                            "      <td>...</td>\n",
                            "      <td>...</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>1280</th>\n",
                            "      <td>yetdnv6j</td>\n",
                            "      <td>1</td>\n",
                            "      <td>0.048499</td>\n",
                            "      <td>9w9w0z4o</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>1281</th>\n",
                            "      <td>yetdnv6j</td>\n",
                            "      <td>2</td>\n",
                            "      <td>0.046675</td>\n",
                            "      <td>6nas74q1</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>1282</th>\n",
                            "      <td>yetdnv6j</td>\n",
                            "      <td>3</td>\n",
                            "      <td>0.044476</td>\n",
                            "      <td>7docv0dt</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>1283</th>\n",
                            "      <td>yetdnv6j</td>\n",
                            "      <td>4</td>\n",
                            "      <td>0.040522</td>\n",
                            "      <td>oj60pldq</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>1284</th>\n",
                            "      <td>yetdnv6j</td>\n",
                            "      <td>5</td>\n",
                            "      <td>0.039635</td>\n",
                            "      <td>jq1xumrh</td>\n",
                            "    </tr>\n",
                            "  </tbody>\n",
                            "</table>\n",
                            "<p>1285 rows × 4 columns</p>\n",
                            "</div>"
                        ],
                        "text/plain": [
                            "      cord_uid  rec_rank  rec_score rec_cord_uid\n",
                            "0     ej795nks         1   0.142033     u7lz3spe\n",
                            "1     ej795nks         2   0.117743     j35w1vsw\n",
                            "2     ej795nks         3   0.100325     nt60lv2k\n",
                            "3     ej795nks         4   0.076779     vp9d9vmp\n",
                            "4     ej795nks         5   0.074392     05d1mhkq\n",
                            "...        ...       ...        ...          ...\n",
                            "1280  yetdnv6j         1   0.048499     9w9w0z4o\n",
                            "1281  yetdnv6j         2   0.046675     6nas74q1\n",
                            "1282  yetdnv6j         3   0.044476     7docv0dt\n",
                            "1283  yetdnv6j         4   0.040522     oj60pldq\n",
                            "1284  yetdnv6j         5   0.039635     jq1xumrh\n",
                            "\n",
                            "[1285 rows x 4 columns]"
                        ]
                    },
                    "execution_count": 15,
                    "metadata": {},
                    "output_type": "execute_result"
                }
            ],
            "source": [
                "# Preview the recommendations\n",
                "top_k_recommendations"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "Optionally, we can access the full recommendation dictionary, which contains full ranked lists for each public domain article."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 16,
            "metadata": {},
            "outputs": [
                {
                    "name": "stdout",
                    "output_type": "stream",
                    "text": [
                        "Number of recommended articles for ej795nks: 256\n"
                    ]
                }
            ],
            "source": [
                "# Optionally view full recommendation list\n",
                "full_rec_list = recommender.recommendations\n",
                "\n",
                "article_of_interest = 'ej795nks'\n",
                "print('Number of recommended articles for ' + article_of_interest + ': ' + str(len(full_rec_list[article_of_interest])))"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "Optionally, we can also view the tokens and stop words which were used in the recommender."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 17,
            "metadata": {},
            "outputs": [
                {
                    "name": "stdout",
                    "output_type": "stream",
                    "text": [
                        "['understanding', 'spatial', 'clustering', 'severe', 'acute', 'respiratory', 'syndrome', 'sa', 'rs', 'hon']\n"
                    ]
                }
            ],
            "source": [
                "# Optionally view tokens\n",
                "tokens = recommender.get_tokens()\n",
                "\n",
                "# Preview 10 tokens\n",
                "print(list(tokens.keys())[:10])"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 18,
            "metadata": {},
            "outputs": [
                {
                    "name": "stdout",
                    "output_type": "stream",
                    "text": [
                        "['a', 'about', 'above', 'across', 'after', 'afterwards', 'again', 'against', 'all', 'almost']\n"
                    ]
                }
            ],
            "source": [
                "# Preview just the first 10 stop words sorted alphabetically\n",
                "stop_words = list(recommender.get_stop_words())\n",
                "stop_words.sort()\n",
                "print(stop_words[:10])"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### 7. Display top recommendations for article of interest\n",
                "Now that we have the recommendation table containing IDs for both query and recommended articles, we can easily return the full metadata for the top k recommendations for any given article."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 19,
            "metadata": {
                "scrolled": false
            },
            "outputs": [
                {
                    "data": {
                        "text/html": [
                            "<style  type=\"text/css\" >\n",
                            "</style><table id=\"T_677d096a_840e_11eb_9516_f18ad58d3cc8\" ><thead>    <tr>        <th class=\"blank level0\" ></th>        <th class=\"col_heading level0 col0\" >rank</th>        <th class=\"col_heading level0 col1\" >similarity_score</th>        <th class=\"col_heading level0 col2\" >title</th>        <th class=\"col_heading level0 col3\" >authors</th>        <th class=\"col_heading level0 col4\" >journal</th>        <th class=\"col_heading level0 col5\" >publish_time</th>        <th class=\"col_heading level0 col6\" >url</th>    </tr></thead><tbody>\n",
                            "                <tr>\n",
                            "                        <th id=\"T_677d096a_840e_11eb_9516_f18ad58d3cc8level0_row0\" class=\"row_heading level0 row0\" >0</th>\n",
                            "                        <td id=\"T_677d096a_840e_11eb_9516_f18ad58d3cc8row0_col0\" class=\"data row0 col0\" >1</td>\n",
                            "                        <td id=\"T_677d096a_840e_11eb_9516_f18ad58d3cc8row0_col1\" class=\"data row0 col1\" >0.142033</td>\n",
                            "                        <td id=\"T_677d096a_840e_11eb_9516_f18ad58d3cc8row0_col2\" class=\"data row0 col2\" >Cynomolgus Macaque as an Animal Model for Severe Acute Respiratory Syndrome</td>\n",
                            "                        <td id=\"T_677d096a_840e_11eb_9516_f18ad58d3cc8row0_col3\" class=\"data row0 col3\" >Lawler, James V; Endy, Timothy P; Hensley, Lisa E; Garrison, Aura; Fritz, Elizabeth A; Lesar, May; Baric, Ralph S; Kulesh, David A; Norwood, David A; Wasieloski, Leonard P; Ulrich, Melanie P; Slezak, Tom R; Vitalis, Elizabeth; Huggins, John W; Jahrling, Peter B; Paragas, Jason</td>\n",
                            "                        <td id=\"T_677d096a_840e_11eb_9516_f18ad58d3cc8row0_col4\" class=\"data row0 col4\" >PLoS Med</td>\n",
                            "                        <td id=\"T_677d096a_840e_11eb_9516_f18ad58d3cc8row0_col5\" class=\"data row0 col5\" >2006-04-18</td>\n",
                            "                        <td id=\"T_677d096a_840e_11eb_9516_f18ad58d3cc8row0_col6\" class=\"data row0 col6\" ><a href=\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1435788/\">https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1435788/</a></td>\n",
                            "            </tr>\n",
                            "            <tr>\n",
                            "                        <th id=\"T_677d096a_840e_11eb_9516_f18ad58d3cc8level0_row1\" class=\"row_heading level0 row1\" >1</th>\n",
                            "                        <td id=\"T_677d096a_840e_11eb_9516_f18ad58d3cc8row1_col0\" class=\"data row1 col0\" >2</td>\n",
                            "                        <td id=\"T_677d096a_840e_11eb_9516_f18ad58d3cc8row1_col1\" class=\"data row1 col1\" >0.117743</td>\n",
                            "                        <td id=\"T_677d096a_840e_11eb_9516_f18ad58d3cc8row1_col2\" class=\"data row1 col2\" >SARS: Systematic Review of Treatment Effects</td>\n",
                            "                        <td id=\"T_677d096a_840e_11eb_9516_f18ad58d3cc8row1_col3\" class=\"data row1 col3\" >Stockman, Lauren J; Bellamy, Richard; Garner, Paul</td>\n",
                            "                        <td id=\"T_677d096a_840e_11eb_9516_f18ad58d3cc8row1_col4\" class=\"data row1 col4\" >PLoS Med</td>\n",
                            "                        <td id=\"T_677d096a_840e_11eb_9516_f18ad58d3cc8row1_col5\" class=\"data row1 col5\" >2006-09-12</td>\n",
                            "                        <td id=\"T_677d096a_840e_11eb_9516_f18ad58d3cc8row1_col6\" class=\"data row1 col6\" ><a href=\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1564166/\">https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1564166/</a></td>\n",
                            "            </tr>\n",
                            "            <tr>\n",
                            "                        <th id=\"T_677d096a_840e_11eb_9516_f18ad58d3cc8level0_row2\" class=\"row_heading level0 row2\" >2</th>\n",
                            "                        <td id=\"T_677d096a_840e_11eb_9516_f18ad58d3cc8row2_col0\" class=\"data row2 col0\" >3</td>\n",
                            "                        <td id=\"T_677d096a_840e_11eb_9516_f18ad58d3cc8row2_col1\" class=\"data row2 col1\" >0.100325</td>\n",
                            "                        <td id=\"T_677d096a_840e_11eb_9516_f18ad58d3cc8row2_col2\" class=\"data row2 col2\" >A Network Integration Approach to Predict Conserved Regulators Related to Pathogenicity of Influenza and SARS-CoV Respiratory Viruses</td>\n",
                            "                        <td id=\"T_677d096a_840e_11eb_9516_f18ad58d3cc8row2_col3\" class=\"data row2 col3\" >Mitchell, Hugh D.; Eisfeld, Amie J.; Sims, Amy C.; McDermott, Jason E.; Matzke, Melissa M.; Webb-Robertson, Bobbi-Jo M.; Tilton, Susan C.; Tchitchek, Nicolas; Josset, Laurence; Li, Chengjun; Ellis, Amy L.; Chang, Jean H.; Heegel, Robert A.; Luna, Maria L.; Schepmoes, Athena A.; Shukla, Anil K.; Metz, Thomas O.; Neumann, Gabriele; Benecke, Arndt G.; Smith, Richard D.; Baric, Ralph S.; Kawaoka, Yoshihiro; Katze, Michael G.; Waters, Katrina M.</td>\n",
                            "                        <td id=\"T_677d096a_840e_11eb_9516_f18ad58d3cc8row2_col4\" class=\"data row2 col4\" >PLoS One</td>\n",
                            "                        <td id=\"T_677d096a_840e_11eb_9516_f18ad58d3cc8row2_col5\" class=\"data row2 col5\" >2013-07-25</td>\n",
                            "                        <td id=\"T_677d096a_840e_11eb_9516_f18ad58d3cc8row2_col6\" class=\"data row2 col6\" ><a href=\"https://doi.org/10.1371/journal.pone.0069374; https://www.ncbi.nlm.nih.gov/pubmed/23935999/\">https://doi.org/10.1371/journal.pone.0069374; https://www.ncbi.nlm.nih.gov/pubmed/23935999/</a></td>\n",
                            "            </tr>\n",
                            "            <tr>\n",
                            "                        <th id=\"T_677d096a_840e_11eb_9516_f18ad58d3cc8level0_row3\" class=\"row_heading level0 row3\" >3</th>\n",
                            "                        <td id=\"T_677d096a_840e_11eb_9516_f18ad58d3cc8row3_col0\" class=\"data row3 col0\" >4</td>\n",
                            "                        <td id=\"T_677d096a_840e_11eb_9516_f18ad58d3cc8row3_col1\" class=\"data row3 col1\" >0.076779</td>\n",
                            "                        <td id=\"T_677d096a_840e_11eb_9516_f18ad58d3cc8row3_col2\" class=\"data row3 col2\" >Genome Wide Identification of SARS-CoV Susceptibility Loci Using the Collaborative Cross</td>\n",
                            "                        <td id=\"T_677d096a_840e_11eb_9516_f18ad58d3cc8row3_col3\" class=\"data row3 col3\" >Gralinski, Lisa E.; Ferris, Martin T.; Aylor, David L.; Whitmore, Alan C.; Green, Richard; Frieman, Matthew B.; Deming, Damon; Menachery, Vineet D.; Miller, Darla R.; Buus, Ryan J.; Bell, Timothy A.; Churchill, Gary A.; Threadgill, David W.; Katze, Michael G.; McMillan, Leonard; Valdar, William; Heise, Mark T.; Pardo-Manuel de Villena, Fernando; Baric, Ralph S.</td>\n",
                            "                        <td id=\"T_677d096a_840e_11eb_9516_f18ad58d3cc8row3_col4\" class=\"data row3 col4\" >PLoS Genet</td>\n",
                            "                        <td id=\"T_677d096a_840e_11eb_9516_f18ad58d3cc8row3_col5\" class=\"data row3 col5\" >2015-10-09</td>\n",
                            "                        <td id=\"T_677d096a_840e_11eb_9516_f18ad58d3cc8row3_col6\" class=\"data row3 col6\" ><a href=\"https://doi.org/10.1371/journal.pgen.1005504; https://www.ncbi.nlm.nih.gov/pubmed/26452100/\">https://doi.org/10.1371/journal.pgen.1005504; https://www.ncbi.nlm.nih.gov/pubmed/26452100/</a></td>\n",
                            "            </tr>\n",
                            "            <tr>\n",
                            "                        <th id=\"T_677d096a_840e_11eb_9516_f18ad58d3cc8level0_row4\" class=\"row_heading level0 row4\" >4</th>\n",
                            "                        <td id=\"T_677d096a_840e_11eb_9516_f18ad58d3cc8row4_col0\" class=\"data row4 col0\" >5</td>\n",
                            "                        <td id=\"T_677d096a_840e_11eb_9516_f18ad58d3cc8row4_col1\" class=\"data row4 col1\" >0.074392</td>\n",
                            "                        <td id=\"T_677d096a_840e_11eb_9516_f18ad58d3cc8row4_col2\" class=\"data row4 col2\" >A Porcine Epidemic Diarrhea Virus Outbreak in One Geographic Region of the United States: Descriptive Epidemiology and Investigation of the Possibility of Airborne Virus Spread</td>\n",
                            "                        <td id=\"T_677d096a_840e_11eb_9516_f18ad58d3cc8row4_col3\" class=\"data row4 col3\" >Beam, Andrea; Goede, Dane; Fox, Andrew; McCool, Mary Jane; Wall, Goldlin; Haley, Charles; Morrison, Robert</td>\n",
                            "                        <td id=\"T_677d096a_840e_11eb_9516_f18ad58d3cc8row4_col4\" class=\"data row4 col4\" >PLoS One</td>\n",
                            "                        <td id=\"T_677d096a_840e_11eb_9516_f18ad58d3cc8row4_col5\" class=\"data row4 col5\" >2015-12-28</td>\n",
                            "                        <td id=\"T_677d096a_840e_11eb_9516_f18ad58d3cc8row4_col6\" class=\"data row4 col6\" ><a href=\"https://doi.org/10.1371/journal.pone.0144818; https://www.ncbi.nlm.nih.gov/pubmed/26709512/\">https://doi.org/10.1371/journal.pone.0144818; https://www.ncbi.nlm.nih.gov/pubmed/26709512/</a></td>\n",
                            "            </tr>\n",
                            "    </tbody></table>"
                        ],
                        "text/plain": [
                            "<pandas.io.formats.style.Styler at 0x7fa916f3a390>"
                        ]
                    },
                    "execution_count": 19,
                    "metadata": {},
                    "output_type": "execute_result"
                }
            ],
            "source": [
                "cols_to_keep = ['title','authors','journal','publish_time','url']\n",
                "recommender.get_top_k_recommendations(metadata_public,article_of_interest,cols_to_keep)"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### Conclusion\n",
                "In this notebook, we have demonstrated how to create a TF-IDF recommender to recommend the top k (in this case 5) articles similar in content to an article of interest (in this example, article with `cord_uid='ej795nks'`)."
            ]
        }
    ],
    "metadata": {
        "kernelspec": {
            "display_name": "Python 3",
            "language": "python",
            "name": "python3"
        },
        "language_info": {
            "codemirror_mode": {
                "name": "ipython",
                "version": 3
            },
            "file_extension": ".py",
            "mimetype": "text/x-python",
            "name": "python",
            "nbconvert_exporter": "python",
            "pygments_lexer": "ipython3",
            "version": "3.6.11"
        }
    },
    "nbformat": 4,
    "nbformat_minor": 2
}
